PM2.5 Trends in California

Author

Hanin Almodaweb

Project Description

I will work with air pollution data from the U.S. Environmental Protection Agency (EPA). The EPA has a national monitoring network of air pollution sites that The primary question I will answer is whether daily concentrations of PM\(_{2.5}\) (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).

Exploraotry Data Analysis

Summary of 2002 Findings

The 2002 data set consists of 15,976 rows and 22 columns (variables), with no apparent missing data in the headers or footers. Initial checks of the data structure indicated a mix of character, integer, and numeric data types. The variable names include Date, Source, Site ID, POC, Daily Mean PM\(_{2.5}\) Concentration, Units, Daily AQI Value, Local Site Name, Daily Obs Count, Percent Complete, AQS Parameter Code, Parameter Description, Method Code, Method Description, CBSA Code, CBSA Name, State FIPS Code, State, County FIPS Code, County, Site Latitude, and Site Longitude. The character variables of interest are date, state, and county. While te numerical variables under study are daily mean PM2.5 concentration, site latitude, and site longitude.

Upon examining the data, the majority of the Daily Mean PM\(_{2.5}\) Concentration values range between 0 and 104.3 µg/m³, with a mean of 16.12 µg/m³, a median of 12 µg/m³, and a maximum value of 185 µg/m³. There are no missing values in the Daily Mean PM\(_{2.5}\) Concentration column, ensuring the key variable of interest is complete for analysis. Nonetheless, while the data set was mostly complete, the presence of missing values requires further investigation to ensure data quality. A closer examination of missing data patterns and potential outliers, particularly in the PM\(_{2.5}\) measurements, is necessary to identify any inconsistencies.

Summary of 2022 Findings

The 2022 data set contains 59,756 rows and 22 columns (variables), with the headers and footers loaded correctly. There is evidence of missing data, though not the main variable of interest, Daily Mean PM\(_{2.5}\) Concentration. The variable names and types remain consistent with the 2002 data set. Observations show that the majority of the Daily Mean PM\(_{2.5}\) Concentration values range from -6.7 to 302.5 µg/m³, with a mean of 8.43 µg/m³, a median of 6.8 µg/m³, and a maximum of 302.5 µg/m³. However, it is worth noting that it is unusual for PM\(_{2.5}\) concentrations to have negative values, as particulate matter is a physical measurement of pollution in the air. A negative value might indicate an issue with the data collection, sensor calibration, or data processing.

Data Analysis

  1. Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
# combining the two data sets
EPA_combined <- rbind(EPA2002, EPA2022, fill = TRUE)
# converting date to date format
EPA_combined$Date <- as.Date(EPA_combined$Date, format = "%m/%d/%Y")

# creating a 'Year' column from the date
EPA_combined$Year <- format(EPA_combined$Date, "%Y")
# renaming the columns of key variables
setnames(EPA_combined, old = c("Daily Mean PM2.5 Concentration", "Daily AQI Value", 
                          "Site ID", "Site Latitude", "Site Longitude"), 
                 new = c("PM2.5", "AQI", "Site_ID", "Latitude", "Longitude"))
# checking the new data set
summary(EPA_combined)
      Date               Source             Site_ID              POC        
 Min.   :2002-01-01   Length:75732       Min.   :60010007   Min.   : 1.000  
 1st Qu.:2022-01-19   Class :character   1st Qu.:60290016   1st Qu.: 1.000  
 Median :2022-05-14   Mode  :character   Median :60612003   Median : 3.000  
 Mean   :2018-04-13                      Mean   :60560422   Mean   : 3.309  
 3rd Qu.:2022-09-09                      3rd Qu.:60731022   3rd Qu.: 3.000  
 Max.   :2022-12-31                      Max.   :61131003   Max.   :24.000  
                                                                            
     PM2.5           Units                AQI        Local Site Name   
 Min.   : -6.70   Length:75732       Min.   :  0.0   Length:75732      
 1st Qu.:  4.50   Class :character   1st Qu.: 25.0   Class :character  
 Median :  7.60   Mode  :character   Median : 42.0   Mode  :character  
 Mean   : 10.05                      Mean   : 43.5                     
 3rd Qu.: 12.20                      3rd Qu.: 57.0                     
 Max.   :302.50                      Max.   :454.0                     
                                                                       
 Daily Obs Count Percent Complete AQS Parameter Code AQS Parameter Description
 Min.   :1       Min.   :100      Min.   :88101      Length:75732             
 1st Qu.:1       1st Qu.:100      1st Qu.:88101      Class :character         
 Median :1       Median :100      Median :88101      Mode  :character         
 Mean   :1       Mean   :100      Mean   :88197                               
 3rd Qu.:1       3rd Qu.:100      3rd Qu.:88101                               
 Max.   :1       Max.   :100      Max.   :88502                               
                                                                              
  Method Code    Method Description   CBSA Code      CBSA Name        
 Min.   :117.0   Length:75732       Min.   :12540   Length:75732      
 1st Qu.:170.0   Class :character   1st Qu.:31080   Class :character  
 Median :170.0   Mode  :character   Median :40140   Mode  :character  
 Mean   :327.8                      Mean   :34595                     
 3rd Qu.:707.0                      3rd Qu.:41740                     
 Max.   :810.0                      Max.   :49700                     
                                    NA's   :5496                      
 State FIPS Code    State           County FIPS Code    County         
 Min.   :6       Length:75732       Min.   :  1.00   Length:75732      
 1st Qu.:6       Class :character   1st Qu.: 29.00   Class :character  
 Median :6       Mode  :character   Median : 61.00   Mode  :character  
 Mean   :6                          Mean   : 55.89                     
 3rd Qu.:6                          3rd Qu.: 73.00                     
 Max.   :6                          Max.   :113.00                     
                                                                       
    Latitude       Longitude          Year          
 Min.   :32.58   Min.   :-124.2   Length:75732      
 1st Qu.:34.07   1st Qu.:-121.4   Class :character  
 Median :36.48   Median :-119.3   Mode  :character  
 Mean   :36.19   Mean   :-119.5                     
 3rd Qu.:37.96   3rd Qu.:-117.9                     
 Max.   :41.76   Max.   :-115.5                     
                                                    
head(EPA_combined)
         Date Source  Site_ID   POC PM2.5    Units   AQI Local Site Name
       <Date> <char>    <int> <int> <num>   <char> <int>          <char>
1: 2002-01-05    AQS 60010007     1  25.1 ug/m3 LC    81       Livermore
2: 2002-01-06    AQS 60010007     1  31.6 ug/m3 LC    93       Livermore
3: 2002-01-08    AQS 60010007     1  21.4 ug/m3 LC    74       Livermore
4: 2002-01-11    AQS 60010007     1  25.9 ug/m3 LC    82       Livermore
5: 2002-01-14    AQS 60010007     1  34.5 ug/m3 LC    98       Livermore
6: 2002-01-17    AQS 60010007     1  41.0 ug/m3 LC   115       Livermore
   Daily Obs Count Percent Complete AQS Parameter Code
             <int>            <num>              <int>
1:               1              100              88101
2:               1              100              88101
3:               1              100              88101
4:               1              100              88101
5:               1              100              88101
6:               1              100              88101
   AQS Parameter Description Method Code                    Method Description
                      <char>       <int>                                <char>
1:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
2:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
3:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
4:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
5:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
6:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
   CBSA Code                         CBSA Name State FIPS Code      State
       <int>                            <char>           <int>     <char>
1:     41860 San Francisco-Oakland-Hayward, CA               6 California
2:     41860 San Francisco-Oakland-Hayward, CA               6 California
3:     41860 San Francisco-Oakland-Hayward, CA               6 California
4:     41860 San Francisco-Oakland-Hayward, CA               6 California
5:     41860 San Francisco-Oakland-Hayward, CA               6 California
6:     41860 San Francisco-Oakland-Hayward, CA               6 California
   County FIPS Code  County Latitude Longitude   Year
              <int>  <char>    <num>     <num> <char>
1:                1 Alameda 37.68753 -121.7842   2002
2:                1 Alameda 37.68753 -121.7842   2002
3:                1 Alameda 37.68753 -121.7842   2002
4:                1 Alameda 37.68753 -121.7842   2002
5:                1 Alameda 37.68753 -121.7842   2002
6:                1 Alameda 37.68753 -121.7842   2002
tail(EPA_combined)
         Date Source  Site_ID   POC PM2.5    Units   AQI      Local Site Name
       <Date> <char>    <int> <int> <num>   <char> <int>               <char>
1: 2022-12-01    AQS 61131003     1   3.4 ug/m3 LC    19 Woodland-Gibson Road
2: 2022-12-07    AQS 61131003     1   3.8 ug/m3 LC    21 Woodland-Gibson Road
3: 2022-12-13    AQS 61131003     1   6.0 ug/m3 LC    33 Woodland-Gibson Road
4: 2022-12-19    AQS 61131003     1  34.8 ug/m3 LC    99 Woodland-Gibson Road
5: 2022-12-25    AQS 61131003     1  23.2 ug/m3 LC    77 Woodland-Gibson Road
6: 2022-12-31    AQS 61131003     1   1.0 ug/m3 LC     6 Woodland-Gibson Road
   Daily Obs Count Percent Complete AQS Parameter Code
             <int>            <num>              <int>
1:               1              100              88101
2:               1              100              88101
3:               1              100              88101
4:               1              100              88101
5:               1              100              88101
6:               1              100              88101
   AQS Parameter Description Method Code
                      <char>       <int>
1:  PM2.5 - Local Conditions         145
2:  PM2.5 - Local Conditions         145
3:  PM2.5 - Local Conditions         145
4:  PM2.5 - Local Conditions         145
5:  PM2.5 - Local Conditions         145
6:  PM2.5 - Local Conditions         145
                                      Method Description CBSA Code
                                                  <char>     <int>
1: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
2: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
3: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
4: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
5: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
6: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
                                 CBSA Name State FIPS Code      State
                                    <char>           <int>     <char>
1: Sacramento--Roseville--Arden-Arcade, CA               6 California
2: Sacramento--Roseville--Arden-Arcade, CA               6 California
3: Sacramento--Roseville--Arden-Arcade, CA               6 California
4: Sacramento--Roseville--Arden-Arcade, CA               6 California
5: Sacramento--Roseville--Arden-Arcade, CA               6 California
6: Sacramento--Roseville--Arden-Arcade, CA               6 California
   County FIPS Code County Latitude Longitude   Year
              <int> <char>    <num>     <num> <char>
1:              113   Yolo 38.66121 -121.7327   2022
2:              113   Yolo 38.66121 -121.7327   2022
3:              113   Yolo 38.66121 -121.7327   2022
4:              113   Yolo 38.66121 -121.7327   2022
5:              113   Yolo 38.66121 -121.7327   2022
6:              113   Yolo 38.66121 -121.7327   2022
  1. Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.
# ensuring the data is in the correct format
EPA_combined$Year <- as.numeric(EPA_combined$Year)

# defining a color palette for the years (2002 and 2022)
palette <- colorFactor(palette = c("turquoise", "pink"), domain = EPA_combined$Year)

# creating the leaflet map
leaflet(EPA_combined) %>%
  addTiles() %>%  
  addCircleMarkers(
    ~Longitude, ~Latitude,  # Set the longitude and latitude
    color = ~palette(Year), # Use different colors for each year
    popup = ~paste("Site ID:", Site_ID, "<br>", 
                   "Year:", Year, "<br>",
                   "PM2.5:", PM2.5, "<br>",
                   "AQI:", AQI),  # Popup information
    radius = 5, fillOpacity = 0.8, stroke = FALSE
  ) %>%
  addLegend(
    "bottomright", 
    pal = palette, 
    values = ~Year, 
    title = "Monitoring Year",
    opacity = 1
  )
Summary of the spatial distribution of the monitoring sites

In 2002, monitoring sites were mainly concentrated around major cities like Los Angeles, San Francisco, and Sacramento, with less coverage in central and eastern regions. By 2022, the number of monitoring sites increased, especially in previously underrepresented areas, indicating an expansion of air quality monitoring infrastructure over the two decades.

  1. Check for any missing or implausible values of PM\(_{2.5}\) in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.
# checking for missing values in PM2.5
missing_PM25 <- EPA_combined[is.na(PM2.5), .N]

# checking for implausible values (e.g., negative values or values above 500 ug/m^3 (as given by the 2012 EPA) 
implausible_PM25 <- EPA_combined[PM2.5 < 0 | PM2.5 > 500, .N]

# total number of observations
total_obs <- nrow(EPA_combined)

# calculating proportions of missing and implausible values
prop_missing <- missing_PM25 / total_obs
prop_implausible <- implausible_PM25 / total_obs

# summary of findings
cat("Total Observations:", total_obs, "\n")
Total Observations: 75732 
cat("Missing PM2.5 Values:", missing_PM25, "(", round(prop_missing * 100, 2), "% )\n")
Missing PM2.5 Values: 0 ( 0 % )
cat("Implausible PM2.5 Values:", implausible_PM25, "(", round(prop_implausible * 100, 2), "% )\n")
Implausible PM2.5 Values: 215 ( 0.28 % )
# exploring temporal patterns in missing and implausible values
missing_by_year <- EPA_combined[is.na(PM2.5), .N, by = Year]
implausible_by_year <- EPA_combined[PM2.5 < 0 | PM2.5 > 500, .N, by = Year]
# displaying the missing and implausible values by year
missing_by_year
Empty data.table (0 rows and 2 cols): Year,N
implausible_by_year
    Year     N
   <num> <int>
1:  2022   215
# examining frequency of implausible values by month
implausible_values <- subset(EPA_combined, PM2.5 < 0 | PM2.5 > 500)

# extracting month from the Date column
implausible_values$Month <- format(as.Date(implausible_values$Date), "%Y-%m")

# creating a table or summary of the count of implausible values by month
implausible_by_month <- table(implausible_values$Month)

# converting to a data frame for easier plotting or viewing
implausible_by_month_df <- as.data.frame(implausible_by_month)

# view the distribution
print(implausible_by_month_df)
      Var1 Freq
1  2022-01   23
2  2022-02   18
3  2022-03    8
4  2022-04    4
5  2022-05   12
6  2022-06   19
7  2022-07   27
8  2022-08    7
9  2022-09   21
10 2022-10    4
11 2022-11   26
12 2022-12   46
Summary of temporal patterns

The combined dataset has a total of 75,732 observations with no missing values for PM\(_{2.5}\), as shown by a missing proportion of 0%. However, there are 215 implausible values (0.28%), defined as PM\(_{2.5}\) concentrations less than 0 or greater than 500, as given by the 2012 EPA. Temporal analysis of these implausible values reveals that all implausible values occurred in 2022, with no such values found in 2002. Delving into the monthly frequencies in which PM\(_{2.5}\) implausible values were recorded, the values were distributed throughout the year, with the highest occurrences in December (46 values) and July (27 values), while April and October had the fewest (4 values each).

  1. Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.
  • State
# sub-setting for California data
california_data <- EPA_combined[State == "California"]

# summary statistics for PM2.5 in California across years
summary_stats_state <- california_data %>%
  group_by(Year) %>%
  summarize(
    mean_PM2.5 = mean(PM2.5, na.rm = TRUE),
    median_PM2.5 = median(PM2.5, na.rm = TRUE),
    sd_PM2.5 = sd(PM2.5, na.rm = TRUE),
    min_PM2.5 = min(PM2.5, na.rm = TRUE),         
    max_PM2.5 = max(PM2.5, na.rm = TRUE),         
    count = n()   
  )

# printing the summary statistics
print(summary_stats_state)
# A tibble: 2 × 7
   Year mean_PM2.5 median_PM2.5 sd_PM2.5 min_PM2.5 max_PM2.5 count
  <dbl>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl> <int>
1  2002      16.1          12      13.9        0        104. 15976
2  2022       8.43          6.8     7.64      -6.7      302. 59756
# histogram of PM2.5 by year
ggplot(data = california_data) + 
  geom_histogram(aes(x = PM2.5, fill = as.factor(Year)), 
                 position = "identity", alpha = 0.6, binwidth = 2) +
  labs(title = "PM2.5 by Year in California", x = "Daily Mean PM2.5 Concentration (µg/m³)", 
       fill = "Year") +
  theme_minimal()

# boxplot of PM2.5 by year
ggplot(california_data, aes(x = as.factor(Year), y = PM2.5)) +
  geom_boxplot(fill = "pink", color = "purple", alpha = 0.7) +
  labs(title = "PM2.5 Concentrations by Year in California (2002-2022)",
       x = "Year",
       y = "Daily Mean PM2.5 Concentration (µg/m³)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# creating a summary data.table for highlighted years
highlight_years <- california_data[Year %in% c(2002, 2022), .(Mean_PM2.5 = mean(PM2.5, na.rm = TRUE)), by = Year]

# creating the line plot
ggplot(california_data, aes(x = Year, y = PM2.5)) +
  # Line plot for average PM2.5 using linewidth
  geom_line(stat = "summary", fun = mean, color = "pink", linewidth = 1) +
  # adding points for highlighted years
  geom_point(data = highlight_years, aes(x = Year, y = Mean_PM2.5), 
             size = 3, color = "purple", fill = "purple", shape = 21) +
  # adding circles around the points for emphasis
  geom_point(data = highlight_years, aes(x = Year, y = Mean_PM2.5), 
             size = 6, color = "purple", shape = 1) +
  labs(title = "Average PM2.5 Concentration Over Time in California (2002-2022)",
       x = "Year",
       y = "Average Daily Mean PM2.5 (µg/m³)") +
  theme_minimal()

Summary of observations

The data indicates a significant decrease in daily PM\(_{2.5}\) concentrations in California from 2002 to 2022. The mean concentration dropped from 16.12 μg/m³ to 8.43 μg/m³, showing nearly a 50% reduction. The spread of values also narrowed, suggesting fewer extreme pollution days. While 2022 still had occasional high pollution events, overall air quality improved markedly, with most days showing much lower PM2.5 levels compared to 2002. This trend reflects advancements in air quality management and pollution control measures over the past two decades.

  • County
# summary statistics for PM2.5 by counties in California across years 
summary_stats_county <- EPA_combined %>%
  group_by(County, Year) %>%
  summarize(
    mean_PM2.5 = mean(PM2.5, na.rm = TRUE),
    median_PM2.5 = median(PM2.5, na.rm = TRUE),
    sd_PM2.5 = sd(PM2.5, na.rm = TRUE),
    min_PM2.5 = min(PM2.5, na.rm = TRUE),         
    max_PM2.5 = max(PM2.5, na.rm = TRUE),         
    count = n(),                                   
    .groups = "drop"  # Add this line to control grouping behavior
  ) %>%
  arrange(County, Year)

# printing the summary statistics
print(summary_stats_county)
# A tibble: 98 × 8
   County        Year mean_PM2.5 median_PM2.5 sd_PM2.5 min_PM2.5 max_PM2.5 count
   <chr>        <dbl>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl> <int>
 1 Alameda       2002      14.3          10      11.4        1.9      61.6   201
 2 Alameda       2022       8.20          7       4.95      -0.7      35.5  1793
 3 Butte         2002      14.8          11.5    11.7        1        88     473
 4 Butte         2022       6.19          4.5     5.79      -0.6      42.8  1121
 5 Calaveras     2002       9.9           8       6.50       2        40      60
 6 Calaveras     2022       6.04          5       4.10       0        25.9   355
 7 Colusa        2002      11.7           9      10.0        1        57      95
 8 Colusa        2022       7.61          6.7     4.76       0.6      37     401
 9 Contra Costa  2002      15.1           9.5    14.5        2        76.7   276
10 Contra Costa  2022       8.25          7.3     4.92       0.9      37.3   817
# ℹ 88 more rows
# ensuring 'Year' is treated as a factor
EPA_combined$Year <- as.factor(EPA_combined$Year)

# creating a bar plot
ggplot(data = EPA_combined, aes(x = County, y = PM2.5, fill = Year)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("2002" = "turquoise", "2022" = "pink")) +
  labs(title = "PM2.5 Trends by County (2002 vs 2022)",
       x = "County",
       y = "Mean Daily PM2_5 (µg/m³)",
       fill = "Year") +
  coord_flip()

Summary of observations

From 2002 to 2022, air quality in California’s counties, measured by PM\(_{2.5}\) levels, generally improved significantly. For example, Alameda County saw a reduction in mean PM\(_{2.5}\) from 14.25 µg/m³ to 8.20 µg/m³, and Butte County’s mean decreased from 14.76 µg/m³ to 6.19 µg/m³. Similar downward trends were observed across counties, such as Fresno, where PM\(_{2.5}\) levels dropped from 19.93 µg/m³ to 10.19 µg/m³. The decrease in both mean and maximum PM\(_{2.5}\) values indicates improved air quality, although variability persisted in some areas with occasional spikes, such as Trinity and Placer counties. Overall, air quality across the state showed marked improvements, with fewer high pollution days over the 20-year period.

  • Sites in Los Angeles
# sub-setting Los Angeles site data
la_data <- EPA_combined[County == "Los Angeles"]

# summary statistics for PM2.5 in Los Angeles sites across years
summary_stats_la <- la_data %>%
  group_by(Year) %>%
  summarize(
    mean_PM2.5 = mean(PM2.5, na.rm = TRUE),
    median_PM2.5 = median(PM2.5, na.rm = TRUE),
    sd_PM2.5 = sd(PM2.5, na.rm = TRUE),
    n = n()
  )

# printing the summary statistics
print(summary_stats_la)
# A tibble: 2 × 5
  Year  mean_PM2.5 median_PM2.5 sd_PM2.5     n
  <fct>      <dbl>        <dbl>    <dbl> <int>
1 2002        19.7         17.4    11.9   1879
2 2022        11.0         10.3     5.24  5070
# histogram of PM2.5 in Los Angeles sites
ggplot(la_data, aes(x = PM2.5, fill = Year)) +
  geom_histogram(binwidth = 2, color = "pink", alpha = 0.7, position = "identity") +
  labs(title = "Distribution of Daily Mean PM2.5 Concentrations at Los Angeles sites (2002-2022)",
       x = "Daily Mean PM2.5 Concentration (µg/m³)",
       y = "Frequency") +
  scale_fill_manual(values = c("2002" = "turquoise", "2022" = "pink")) + # Custom colors for each year
  theme_minimal() +
  theme(legend.position = "top")

# loading gridExtra library 
library(gridExtra)

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine
# Splitting data into 2002 and 2022 subsets
  LA_2002 <- subset(la_data, Year == 2002)
  LA_2022 <- subset(la_data, Year == 2022)
  
# Ensure correct handling of dates (adding the year manually)
LA_2002$Date <- as.Date(paste("2002", format(LA_2002$Date, "%m-%d"), sep = "-"))
LA_2022$Date <- as.Date(paste("2022", format(LA_2022$Date, "%m-%d"), sep = "-"))

# Check if dates are ordered correctly
LA_2002 <- LA_2002[order(LA_2002$Date), ]
LA_2022 <- LA_2022[order(LA_2022$Date), ]

# Plotting PM2.5 levels for 2002
plot_2002 <- ggplot(LA_2002, aes(x = Date, y = PM2.5)) +
  geom_line(color = "turquoise") +
  geom_point(color = "turquoise") +
  scale_x_date(date_labels = "%b", date_breaks = "1 month") +  # Set month labels
  labs(title = "Change in PM2.5 in Los Angeles in 2002", x = "Month in 2002", y = "Daily Mean PM2.5 Concentration (µg/m³)") +
  theme_minimal()

# Plotting PM2.5 levels for 2022
plot_2022 <- ggplot(LA_2022, aes(x = Date, y = PM2.5)) +
  geom_line(color = "pink") +
  geom_point(color = "pink") +
  scale_x_date(date_labels = "%b", date_breaks = "1 month") +  # Set month labels
  labs(title = "Change in PM2.5 in Los Angeles in 2022", x = "Month in 2022", y = "Daily Mean PM2.5 Concentration (µg/m³)") +
  theme_minimal()

# Arrange both plots side-by-side
grid.arrange(plot_2002, plot_2022, ncol = 2)

Summary of Observations

In Los Angeles County, the air quality significantly improved from 2002 to 2022, as indicated by a decrease in PM\(_{2.5}\) levels. In 2002, the mean PM\(_{2.5}\) was 19.66 µg/m³, with a median of 17.4 µg/m³, and a standard deviation of 11.88 µg/m³, based on 1,879 observations. By 2022, the mean PM\(_{2.5}\) had dropped to 10.97 µg/m³, with a median of 10.3 µg/m³ and a standard deviation of 5.24 µg/m³, based on a larger dataset of 5,070 observations.